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ABSTRACT 

As the volume of the RDF data becomes increasingly large, it is 
essential for us to design a distributed database system to man¬ 
age it. For distributed RDF data design, it is quite common to 
partition the RDF data into some parts, called fragments, which 
are then distributed. Thus, the distribution design consists of two 
steps: fragmentation and allocation. In this paper, we propose a 
method to explore the intrinsic similarities among the structures of 
queries in a workload for fragmentation and allocation, which aims 
to reduce the communication cost during SPARQL query process¬ 
ing. Specifically, we mine and select some frequent access pat¬ 
terns to reflect the characteristics of the workload. Based on the 
selected frequent access patterns, we propose two fragmentation 
strategies, vertical and horizontal fragmentation strategies, to di¬ 
vide RDF graphs while meeting different kinds of query process¬ 
ing objectives. Vertical fragmentation is for better throughput and 
horizontal fragmentation is for better performance. After fragmen¬ 
tation, we discuss how to allocate these fragments to various sites. 
Finally, we discuss how to process a query based on the results of 
fragmentation and allocation. Extensive experiments confirm the 
superior performance of our proposed solutions. 

1. INTRODUCTION 

As a standard model for publishing and exchanging data on the 
Web, Resource Description Framework (RDF) has been widely 
used in various applications to expose, share, and connect pieces 
of data on the Web. In RDF, data is represented as triples of the 
form (subject, property, object). An RDF dataset can be naturally 
seen as a graph, where subjects and objects are vertices connected 
by named relationships (i.e., properties). SPARQL is a structured 
query language proposed by W3C to access RDF repository. As 
we know, answering a SPARQL query Q is equivalent to finding 
subgraph matches of query graph Q over an RDF graph G (D 
Figures and 1^ show an RDF graph and a set of SPARQL query 
graphs used as the running example in this paper. 

As RDF repositories increase in size, evaluating SPARQL queries 
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is beyond the capacity of a single machine. For example, DBpedia, 
a project aiming to extract structured content from Wikipedia, con¬ 
sists of 2.46 billion RDF triples 0; according to the W3C, the 
numbers of triples in some commercial RDF datasets have been 
more than 1 trillion 0. The large-scale of RDF data volume in¬ 
creases the demand of designing the high performance distributed 
RDF database system. 

In distributed database design, the first issue is “data fragmenta¬ 
tion and allocation” fT^ . We need to divide an RDF graph into sev¬ 
eral parts, cdllQd fragments, and then distribute them among sites. 
One important issue during data fragmentation and allocation in 
a distributed system is how to reduce the communication cost be¬ 
tween different fragments during distributed query evaluation (as¬ 
suming different fragments are resident at different sites). To min¬ 
imize the communication cost, many existing graph fragmentation 
strategies maximize the global goal (such as min-cut tT2l). How- 
ever, evaluating a SPARQL query is a subgraph (homomorphism) 
match problem. The subgraph match computation often does not 
involve all vertices in graph G, and the communication cost of sub¬ 
graph match computation depends on not only the RDF graph but 
also the query graph. In other words, subgraph match computa¬ 
tion exhibits strong locality. There is no direct relation between 
minimizing the communication cost (in subgraph match computa¬ 
tion) and maximizing the global goal. Hence, we propose a local 
pattern-based fragmentation strategy in this paper, which can re¬ 
duce the communication cost of subgraph match computation. 

The intuition behind the local pattern-based fragmentation is as 
follows: if a query “satisfies” a local pattern and all its matches are 
in a single fragment, then the query can be evaluated on the single 
fragment and no communication cost is needed to answering the 
query. The key issue in local pattern-based fragmentation is how 
to define the “local patterns”. Different from the existing methods, 
we consider the query workload-driven “local pattern” definition. 

1.1 Why Query Workload Matters ? 

The workload-driven distributed data fragmentation has been well 
studied in relational databases fTS) . However, few RDF data frag¬ 
mentation proposals consider the query workload except for 00. 
We will review these related papers in Section 0 Here, we discuss 
why the query workloads is important for RDF data fragmentation. 

We study one real SPAQRL query workload, the DBpedia query 
workload, which records 8,151,238 SPARQL queries issued in 14 
days of 201:^ For this workload, if we set the minimum support 
threshold as 0.1% of the total number of queries, we mine 163 fre¬ 
quent subgraph patterns. The most surprising is that 97% query 
graphs are isomorphic to one of the 163 frequent subgraph pat- 
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Figure 1: Example RDF Graph 
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Figure 2: Example SPARQL Query Graphs 


terns. Thus, if we use these frequent subgraph patterns as the basic 
fragmentation units, 97% SPARQL queries do not lead to commu¬ 
nication cost, since their matches are resident at one fragment. 


fragmentation strategies, i.e, vertical and horizontal fragmentation. 
These two fragmentation strategies are adaptive to different query 
processing objectives. The objective of vertical fragmentation strat¬ 
egy is to improve the query throughout, and requires that all struc¬ 
tures involved by one frequent access pattern should be placed to 
the same fragment. Instead, the horizontal fragmentation strategy 
distributes the structures involved by one frequent access pattern 
among different fragments to maximize the parallelism of query 
evaluation, namely, reducing the query response time for a single 
query. To perform the horizontal fragmentation over RDF graphs, 
we extend the concept of “minterm predicate” in fTS) to “structural 
minterm predicate” (see Section [5l^ , which consider the structures 
of both RDF graphs and workloads. Different applications have 
different requirements, so we provide customizable options that can 
be used for different RDF graphs and SPARQL query workloads. 

Query Decomposition. As we know, the query decomposition 
always depends on the fragmentation. In traditional vertical and 
horizontal fragmentation in RDBMS and XML, the query decom¬ 
position is unique, since there is no overlap between different frag¬ 
ments. As mentioned before, there are some data replications in 
our fragmentation strategies for RDF graphs. Thus, we may have 
multiple decomposition results for a query. A cost model driven 
selection is proposed in this paper. 

The contributions of this paper can be summarized as follows: 

• We analyze the characteristics of the real SPARQL query 
workload and use the intrinsic similarities of queries in the 
workload to mine and select some frequent access patterns 
for distributed RDF data design. Although we prove that the 
problem of frequent access pattern selection is NP-hard, we 
propose a heuristic method to achieve the good performance. 

• Based on the above scheme, we propose two fragmentation 
strategies, vertical and horizontal fragmentation, to divide 
the RDF graph into many fragments and a cost-aware allo¬ 
cation algorithm to distribute fragments among sites. The 
two fragmentation strategies provide customizable options 
that are adaptive to different applications. 


1.2 Our Solution 

According to the above motivation, we propose a workload-driven 
data fragmentation for distributed RDF graph systems. Specifi¬ 
cally, we first mine frequent subgraph patterns, n?LmQd frequent ac¬ 
cess patterns, in the query workload. We treat these frequent access 
patterns as the implicit schemas for the underlying RDF data. Then, 
we propose two fragmentation strategies based on these implicit 
schemas. We study the following technical issues in this paper. 

Frequent Access Pattern Selection. Given a frequent access pat¬ 
tern, we build a fragment by collecting all its matches in the RDF 
graph. In this way, we can reduce the communication cost (i.e., 
improve query performance) if a SPARQL query satisfies the fre¬ 
quent access pattern. However, if we simply select all frequent 
access patterns as the implicit schemas, it may lead to expensive 
space cost due to the data replication, since different frequent ac¬ 
cess patterns may involve share the same edges. In other words, we 
have a tradeoff between performance gain and space cost during se¬ 
lecting frequent access patterns. We formalize the frequent access 
pattern selection problem (Section |4.1| ) and prove that it is a NP- 
hard problem (Theorem[^. Thus, we propose a heuristic algorithm 
which can guarantee the data integrity and the approximation ratio 
(Theorem [^. This algorithm also achieves the good performance 
(See experiments in Section [^. 

Vertical and Horizontal Fragmentation. Based on the selected 
frequent access patterns (i.e., implicit schemas), we design two 


• We propose a cost-aware query optimization method to de¬ 
compose a SPARQL query and generate a distributed exe¬ 
cution plan. With the decomposition results and execution 
plan, we can efficiently evaluate the SPARQL query. 

• We do experiments over both real and synthetic RDF datasets 
and SPARQL query workloads to verify our methods. 

2. PRELIMINARIES 

In this section, we review the terminologies used in this paper 
and formally define the problem to be addressed. 

2.1 RDF and SPARQL 

RDF data can be represented as a graph according to the follow¬ 
ing definition. 

Definition 1. (RDF Graph) An RDF graph is denoted as G = 
{V(G), E(G), L], where (1) V(G) is a set of vertices that correspond 
to all subjects and objects in RDF data; (2) E(G) c V(G) X V(G) is 
a set of directed edges that correspond to all triples in RDF data; 
and (3) L is a set of edge labels. For each edge e G E(G), its edge 
label is its corresponding property. 

Similarly, a SPARQL query can also be represented as a query 
graph Q. For simplicity, we ignore FILTER statements in SPARQL 
syntax in this paper. 































































Definition 2. (SPARQL Query) A SPARQL query is denoted 
as Q = {V(Q), E(Q),L'], where (1) V(Q) c V(G) U is a set 
of vertices, where V(G) denotes vertices in RDF graph G and Vvar 
is a set of variables; (2) E(Q) c V(Q) X V{Q) is a set of edges in 
Q; and (3) L' is also a set of edge labels, and each edge e in E{Q) 
either has an edge label in L (i.e., property) or the edge label is a 
variable. 

In this paper, we assume that 2 is a connected graph; otherwise, 
all connected components of Q are considered separately. Given 
a SPARQL query Q over RDF graph G, a SPARQL match is a 
subgraph of G that is homomorphic to Q (H) Thus, answering a 
SPARQL query is equivalent to finding all subgraph matches of Q 
over RDF graph G. The set of all matches for Q over G is denoted 
as IQJg 

In this work, we study a query workload-driven fragmentation. 
A query workload Q - [Qi, Q 2 , ..., Qq] is a set of queries that users 
input in a given period. 

2.2 Fragmentation & Allocation 

In this paper, we study an efficient distributed SPARQL query 
engine. There are many issues related to distributed database sys¬ 
tem design, but, the focus of this work is “data fragmentation and 
allocation” for RDF repository. We formalize two important prob¬ 
lems as follows. 

Definition 3. (Fragmentation) Given an RDF graph G, a 
fragmentation T of G is a set of graphs T = {Fi,..., F„} such that: 

(1) each Fi is a subgraph of G and called as a fragment of RDF 
graph G; (2) E(Fi) U ... U F(FJ = E(G); and (3) V(Fi) U ... U 
V{Ff) - V(G), where E(Fi) and V(Fi) denote the edges and ver¬ 
tices in Fi (i - 1, ..,n). 

In our work, we allow the overlaps between different fragments. 
Given a fragmentation T, the next issue is how to distribute these 
fragments among different sites (i.e., computing nodes). This is 
called allocation. 

Definition 4. (Allocation) Given a fragmentation T - {Fi,..., 
Fn] over an RDF graph G and a set of sites S = {Si,S 2 , •••,‘5'^} 
(usually m < n), an allocation = {Ai,..., A^} of fragments in 
to S is a partitioning ofT such that (1) Aj c where I < j < m; 

(2) Aj^ n Aj^ - 0, where I < ji 7*2 < m; (3) Ai U ... U A^^ = T; 
and (4) All fragments in Aj are stored at site S j, where 1 < j < m. 

Given an RDF graph G, a query workload Q and a distributed 
system consisting of sites <S, the goal of this paper is to first de¬ 
compose G into a fragmentation "F and then finding the allocation 
Jiofrto S. 

3. OVERVIEW 

This paper studies a SPARQL query workload-driven data frag¬ 
mentation and allocation problem. Some observations on the real 
query workload tell us that some RDF properties have few access 
frequencies. For example, few users input queries contain the prop¬ 
erties like images kyline and wappen in Figure[^ As well, the clas¬ 
sical distributed database design suggests a “80/20” rule, meaning 
the active “ 20 %” of query patterns account for “80%” of the total 
query input |24| . Therefore, we divide the whole RDF repository 
into two parts: “hot graph” and “cold graph” as follows. 

Definition 5. (Infrequent and Frequent Property) Given a 
query workload Q = {Qi,---Qn]> if a property p occurs in less than 
6 queries in Q, where 6 is an user specified parameter, p is an 
infrequent property; otherwise, p is a frequent property. 


Site Site Site Site 



Figure 3: System Architecture 


Definition 6. (Hot and Cold Graphs) Given an edge e - 
UiUj e E(G) with property p, if property p is a frequent property, e 
is a hot edge; otherwise, e is a cold edge. 

Given an RDF graph G, it is divided into two parts: hot graph H 
and cold graph C, where H consists of all hot edges and C consists 
of all cold edges. 

The goal of this work is how to partition “hot graph” to achieve 
performance improvement. We regard the cold graph as a “black 
block”. The cold graph does not overlap to the hot graph, since the 
cold graph contains different edges with different kinds of proper¬ 
ties from the hot graph. Any existing approach can be utilized for 
the cold graph. We only consider the cold graph in the SPARQL 
query processing (Section [^, since some queries may involve “in¬ 
frequent” properties. Moreover, both the cold graph and the hot 
graph may be disconnected. 

Figure [^illustrates our system architecture. In the offline phase, 
we mine ihQ frequent access patterns (see Section [^ in the work¬ 
load. Each frequent access pattern can correspond to one or more 
fragments. Generating a fragment from all matches of a frequent 
access pattern make many queries be answered efficiently without 
cross-fragments joins, while it may also replicate some hot edges 
and increase the space cost. Thus, we should select an appropriate 
subset of frequent access patterns to balance the efficiency and the 
space cost. Since we find out that selecting an appropriate set of 
patterns is a NP-hard problem (Section [TT] ), we propose a heuristic 
pattern selection solution while guaranteeing both the data integrity 
and the approximation ratio. Based on these selected frequent ac¬ 
cess patterns, we study two different data fragmentation strategies, 
i.e., vertical and horizontal fragmentation (Sectionj^. The vertical 
fragmentation is to improve the query throughput, and the hori¬ 
zontal fragmentation is to reduce a single query’s response time. 
Fragments are distributed among different sites. Meanwhile, we 
maintain the metadata in a data dictionary. 

In the online phase, we study how to decompose a query into 
several subqueries on different fragments and generate an efficient 
execution plan. A cost model for guiding decomposition is pro¬ 
posed (Section [T^ . Finally, we execute the plan and return the 
matches of the query (Section [73] ). 

4 . FREQUENT ACCESS PATTERNS 

As mentioned before, we believe that a query often contains 
some patterns in the previously issued queries, so we mine some 
patterns with high access frequencies and use these patterns as the 
fragmentation units. Then, if a query Q can be decomposed to 
some subgraphs isomorphic to the frequent access patterns, Q can 
be answered while avoiding some joins across multiple fragments. 

Before we mine frequent access patterns, we first normalize the 
query graphs in the workload to avoid overfitting. For each SPARQL 
query, we remove all constants (strings and URIs) at subjects and 
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objects and replace them with variables. The FILTER expressions 
are also removed. By doing this, we extract a general representa¬ 
tion of a SPARQL query from the workload. Figure H shows the 
generalized query graphs of query graphs in FigureWe assume 
that the generalized query in Figure [^graphs are also frequent ac¬ 
cess patterns. 

To mine patterns with high access frequencies, we need to first 
count the number of queries in the workload where a pattern p is 
a subgraph. We define the frequent access pattern usage value to 
record the access frequencies of the frequent access patterns. 

Definition 7. (Frequent Access Pattern Usage Value) Given 
a SPARQL query Q and a frequent access pattern p, we associate 
a frequent access pattern usage value, denoted as use(Q,p), and 
defined as follows: 

(1 if pattern p is a subgraph of Q 
use(Q,p) - \ 

y 0 otherwise 

Then, given a workload Q = {Qi, Qi ,..., Qq] and a pattern p, we 
define the access frequency, acc(p), as the number of queries in Q 
where a pattern pis sl subgraph. 

q 

acc(p) = Z use{Qk,p) 

k=l 

A pattern p is frequent access pattern if its access frequency is no 
less than a threshold, minS up. 

The frequent access patterns can be easily generated by exist¬ 
ing frequent graph mining algorithms GZ) Given a workload of 
SPARQL queries Q = {2i, 22, •••, 2^} in a given period, we denote 
the set of frequent access patterns that we find as P = {pi,p 2 ,..., Px)- 
In practice, the size of P is often limited. For example, if we set 
minSup as 0.1% of the total access frequency, there are only 163 
frequent access patterns for DBPedia. 

4.1 Frequent Access Pattern Selection 

Obviously, it is not necessary to generate fragments from all fre¬ 
quent access patterns due to high space cost. For two similar fre¬ 
quent access patterns p and p', if they are contained by similar 
queries of the workload, then selecting both p and p' for building 
fragments will not be able to provide more information than select¬ 
ing one of p and p'. Hence, it is often sufficient to only select a 
subset of all frequent access patterns to generate fragments. 

To select a subset of all frequent access patterns, there are two 
factors that we should consider. 

1. (Hitting the Whole Workload) We should select frequent ac¬ 
cess patterns to hit the query workload as much as possible. 
This is because that when we select a frequent access pattern 
to generate a fragment, all queries isomorphic to this pattern 
can be answered directly, which improve the efficiency. 

2. (Satisfying the Storage Constraint) The total storage of the 
system in real applications is limited, so selecting too many 
frequent access patterns is not desirable. 


The above two factors contradict each other. Hitting the whole 
workload requires to select as many frequent access patterns as pos¬ 
sible, while the storage constraint requires to select not too many 
frequent access patterns. There should be a tradeoff between the 
two factors. In the following, we propose a cost model to combine 
these two factors for selecting a set of frequent access patterns. 

4.1.1 Hitting the Whole Workload 

If a fragment is generated from the graph induced by matches 
of a frequent access pattern, then evaluating all queries containing 
the pattern can be speeded up by using this fragment. The more 
queries a frequent access pattern hits, the more gains we obtain 
during query processing. Therefore, the benefit of selecting a fre¬ 
quent access pattern to generate its corresponding fragment should 
be defined based on the number of queries that the frequent access 
pattern hits. 

In addition, if two similar frequent access patterns are contained 
by the same set of queries in the workload, it is probably wise to 
include only one of them. Generally speaking, among similar fre¬ 
quent access patterns contained by the same number of queries, it is 
often sufficient to materialize only the largest frequent access pat¬ 
tern. That is to say, if /?', a subgraph of p, is contained by the same 
set of queries as p, p is more beneficial than p' to be selected as 
building fragments. This is because that if we select the larger pat¬ 
tern, a query is more probable to be decomposed to fewer number 
of subqueries during query processing. Fewer subqueries can avoid 
some distributed joins, which can improve the efficiency of query 
processing. 

The above observation implies that larger frequent access pat¬ 
terns are more beneficial to be selected as building fragments. This 
above criterion on the selection of frequent access patterns is for¬ 
mally defined as size-increasing benefit. 

Definition 8. (Size-increasing Benefit) Given a frequent ac¬ 
cess pattern p, the benefit of selecting p for hitting the query Q, 
Benefit(p, Q), is denoted as follows. 

Benefit(p, Q) = \E(p)\ X use(Q,p) 

Furthermore, a query in the workload may contain multiple se¬ 
lected frequent access patterns. This means that the query can 
be decomposed into multiple sets of subqueries if we evaluate the 
query. Each set of subqueries can map to an execution plan. Since 
only one execution plan is finally selected to evaluate the query, a 
query in the workload should only be limited to contribute to the 
benefits of some particular frequent access patterns once. Based 
on this observation, we limit a query to only contribute the largest 
frequent access pattern that the query contains. 

Definition 9. (Benefit of a Frequent Access Pattern Set) Given 
a set of frequent access patterns P' c P, the benefit of selection of 
P' over the workload Q is the sum of the maximum benefit of its 
frequent access patterns over Q. 

Benefit(P',Q) - V m 20 s{Benefit(p,Q)] 

Q^Q. 

4.1.2 Satisfying the Storage Constraint 

Furthermore, the total storage of the system in real applications 

is limited, so selecting too many frequent access patterns is not 
desirable. The selection of frequent access patterns should meet 
some constraints. When the size of all fragments is larger than 
the storage constraint, we cannot further select any more frequent 
access patterns. We normalize the storage capacity of the system to 


a value S C. Then, we have the constraint as: 

Yj llpll X \E{p)\ < sc 

pcP' 

Here, we assume that SC is larger than the number of edges in 
the hot graph, so each hot edge can have at least one copy. This 
assumption guarantees the completeness of the RDF graph. 

4.1.3 Combining the Two Factors 

Then, our optimization objective is to maximize the benefit sub¬ 
ject to the storage constraint. We can prove that this benefit function 
(Definition]^ is submodular as follows, so this problem is NP-hard. 

Theorem 1. Finding a set of frequent access patterns with the 
largest benefit while subject to the storage constraint is NP-hard. 

Prooe. Here, we prove that the benefit function Benefit{P\ Q) - 
YaQcQ^^'^pcP'[\E{p)\ X use{Q,p)] is submodular. In other words, 
for every Pi c P 2 and a frequent access pattern p ^ P 2 , we need to 
prove that > ABenefuiplPl)- 

For pattern p, we assume that Q' is the set of queries containing 
p in the workload. There are three kinds of queries in Q ': the set 
of queries not containing any patterns in P 2 , the set Q 2 of queries 
containing patterns in (P 2 - Pi), and the set Q 3 of queries only 
containing patterns in Pi. 

Since any query in Qi and ^3 does not concern patterns in (P 2 - 
Pi), Benefit{{p]UPi,QiUQ 3 ) = Benefit({p]UP 2 ,QiUQ 3 ). Hence, 
the marginal gains of p for Pi and P 2 over Qi and ^3 are the same. 

For Q 2 , AsenefitiplPi) > ABenefit{p\Pi)^ if thcrc cxist at Icast one 
query Qj meeting all the two following conditions: 1 ) the largest 
pattern contained by Q* over P 2 is in (P 2 - Pi) and has larger size 
than p\ 2) the largest pattern contained by Q* over Pi has smaller 
size than p. The above two conditions mean that p can only in¬ 
crease the benefit of Pi over Q 2 but not the benefit of P 2 over Q 2 . 
Otherwise, for (^ 2 , ABenefit{p\P\) = ABenefit{p\P 2 )- 

In conclusion, ABenefit{p\P\) ^ ABenefit{p\P 2 ) ^nd the function 
Benefit{P',Q) is submodular. Since the problem of maximizing 
submodular functions is NP-hard the problem is NP-hard. □ 

4.1.4 Our Solution 

As proved in Theorem]^ frequent access pattern selection is NP- 
complete problem. We propose a greedy algorithm as outlined in 
Algorithm]^ Note that, to guarantee data integrity of distributed 
RDF data fragmentation, each hot edge should be contained in at 
least one fragment. Hence, we initialize a pattern of one edge for 
each frequent property and compute out its corresponding fragment 
(Line 3-6). 

After we select all patterns with one edge, we enumerate all fea¬ 
sible frequent access pattern sets containing one pattern of more 
than one edge. Let Pi be a feasible set of cardinality one that 
has the largest benefit (Line 7). Then, we iteratively select one 
of the remaining frequent access patterns p' to maximize the value 
of ]^p^^Q)-BmefiKP ,Q) meet the storage constraint or 

cannot find a frequent access pattern to increase the benefit (Line 8 - 
14). Let P 2 be the solution obtained in the iterative phase. Finally, 
the algorithm outputs P' UPi if Benefit{P' CPi,Q) > Benefit{P' U 
P 2 , Q) and P' U P 2 otherwise (Line 15-17). 

Theorem 2. Algorithm^obtains a set of frequent access pat¬ 
terns of benefit at least min{ ^^^^ pI^(p)I) ’ times the value 

of an optimal solution. 

Prooe. There are two parts in Algorithm]^ initialization and 
greedy selection of frequent access patterns. 
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Algorithm 1: Frequent Access Pattern Selection Algorithm 

Input: A set of frequent access patterns P = [p\,P2, 

Output: A set P' c P to generate fragments 
P' ^ 0; 

TotalS ize <— 0; 

for each p e P and p has only one edge do 
P' ^ P' U [p}\ 

P^P-{p}; 

Totals ize <— Totals ize + |P([[p]]g)I; 

Pi ^ argmaxi^^jfP^^^^ : pi G P, |P([[p/]]g)I + Totals ize < 
5Ca|P(p0I>1}; 

P2^0; 

TOtalS ize' <— 0; 

while T OtalS ize' < SC -T otalS ize do 

Find the frequent access pattern p' e P - P' with the largest 

additional value of . 

P2^P2U{p'}; 

P^P-{p'}; 

Totals ize' TotalS ize' + |P([[p']]g)I; 

if BenefitiP' U Pi,Q) > Benefit(P' U P2,Q) then 
I Return P'U Pi; 

Return P' U P 2 ; 


For initialization (Line 3-6 in Algorithm]^, all selected patterns 
only contain one edge, so |P(p)| = 1. Therefore, the benefit of pat¬ 
terns only having one edge of a frequent property is max^ep/ {lx 
use(Q,p)}. Since the hot edges hit almost all queries in the work¬ 
load, maxpep/11 xuse{Q, p)] is approximately equal to the size 

of the workload, \Q\. On the other hand, in the worst case, the op¬ 
timal solution is that all queries in the workload contain the largest 
frequent access pattern. Then, the benefit of the optimal solution 
is IjQcQ{\E{pmax)\ X use{Q,p)], where p^ax is the frequent pattern 
with the largest size. Hence, the benefit of the selected patterns in 
the initial phase is at least (^ax \E{p)\) optimal benefit. 

For the phase of greedily selecting frequent access patterns (Line 
7-14 in Algorithm]^, since the problem of selecting the optimal set 
of frequent access patterns is a problem of maximizing a submod¬ 
ular set function subject to a knapsack constraint as discussed in 
Theorem]^ we directly apply the greedy algorithm in fTT| to iter¬ 
atively select frequent access patterns. m proves that the worst- 
case performance guarantee of the greedy algorithm is ^(1 - ^), so 
the benefit of the selected patterns in this phase is at least ^(1 - ^) 
of the optimal benefit. 

In summary, the final performance guarantee of our algorithm is 

5. FRAGMENTATION 

In this section, we present two fragmentation strategies: vertical 
and horizontal. 

5.1 Vertical Fragmentation 

For vertical fragmentation, we put matches homomorphic to the 
same frequent access pattern into the same fragment. Because a 
query graph often only contains a few frequent access patterns and 
matches of one frequent access pattern are put together, other ir¬ 
relevant fragments can be filtered out during query evaluation and 
only sites stored relevant fragments need to be accessed to find 
matches. Filtering out irrelevant fragments can improve the query 
performance. Furthermore, sites not storing relevant fragments can 
be used to evaluate other queries in parallel, which improves the 
total throughput of the system. In summary, the vertical fragmen¬ 
tation strategy utilizes the locality of SPARQL queries to improve 













both query response time and throughput. Experimental results in 
Sectionalso confirm the above argument. 

Given a frequent access pattern p, it can then be transformed 
into a SPARQL query, resulting in a vertical fragment of the RDF 
graph. We use the results ipJc of a selection operation based on 
p to generate a vertical fragment. All vertical fragments generated 
from our selected frequent access patterns construct a vertical frag¬ 
mentation. Given a set of frequent access patterns P, we formally 
define its corresponding vertical fragmentation over an RDF graph 
G as follows. 

Definition 10. (Vertical Fragmentation ) Given an RDF graph 
G and a frequent access pattern p, a vertical fragment F generated 
from p is defined as F = {V(P), E(F),L"], where (1) V(F) c V(G) 
is the set of vertices occurring in IpJg; (^) E(F) c E(G) is the set 
of edges occurring in ^pJci (3) L" Q L is the set of edge labels 

occurring in [[pJg- 

Then, given a set of frequent access patterns P - {pi,p 2 , 
the corresponding vertical fragmentation is T - {P/|0 < i < x and 
Ei is the vertical fragment generated from /?/.} 

Example 1. Given the frequent access pattern p^ in Eigure^ 
Eigure^shows the corresponding vertical fragment. 
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Figure 5: Example Vertical Fragment 

5.2 Horizontal Fragmentation 

For horizontal fragmentation, we put matches of one frequent ac¬ 
cess pattern into the different fragments and distribute them among 
different sites. Then, a query may involve many fragments and 
each fragment has a few matches. The size of a fragment is often 
much smaller than the size of the whole data, so finding matches of 
a query over a fragment explores smaller search space than finding 
matches over the whole data. If the fragments involved by a query 
are allocated to different sites, then each site finds a few matches 
over some fragments with the smaller size than the whole data. This 
strategy is to utilize the parallelism of clusters of sites to reduce the 
query response time. The above argument is also confirmed by the 
experimental results in Section 

In this section, we extend the concepts of simple predicate and 
minterm predicate originally developed for relational systems 
to divide the RDF graph horizontally. 

5 . 2 .1 Structural Minterm Predicate 

First, we define the structural simple predicate. Each structural 
simple predicate corresponds to a frequent access pattern with a 
single (in)equality. Given a frequent access pattern p with variables 
set {vari, var 2 ,..., var^}, a structural simple predicate sp defined on 
D has the following form. 

sp : p(vari) 6 Value 

where ^ E {=, ^} and Value is a constant constraint for vart chosen 
from a query containing p in Q. 


Example 2. Let us consider the query graph Qi, in Eigure 
1^ and its corresponding frequent access pattern p^ in Eigure 
We can generate four structural simple predicates: (1). spi : 
P 2 (lxl) = Aristotle; (2). sp 2 : psilxl) ^ Aristotle; (3). sp^ : 
P3(1x2) = Ethics; (4). sp 4 : psilxl) ^ Ethics. 


Then, we define the structural minterm predicate as the conjunc¬ 
tion of structural simple predicates of the same frequent access pat¬ 
tern. We can obtain all structural minterm predicates by enumerat¬ 
ing all possible combinations of structural simple predicates. Given 
a set of structural simple predicates SP = {spi, sp 2 ,spy] for 
frequent access pattern p, the set of structural minterm predicates 
M - {mpi,mp 2 , ...,mpz] for p is defined as follows. 

M = [mpi\ L spl, \ <k< y) 

spk^SP 

where sp^ - sp]^ or sp^ - ^spk. So each structural simple predi¬ 
cate can occur in a structural minterm predicate either in its natural 
form or its negated form. 

Similar to the frequent access pattern, we can also define the 
structural minterm predicate usage value and access frequency to 
record the access frequency of a structural minterm predicate. We 
can prune the minterm predicates with small access frequencies. 


Definition 11 . (Structural Minterm Predicate Usage Value) 
Given a SPARQL query Q and a structural minterm predicate mp, 
we associate a structural minterm predicate usage value, denoted 
as use(Q, mp), and defined as follows: 


use(Q,mp) 


1 if predicate mp is a subgraph of Q 
0 otherwise 


Then, given a set of SPARQL queries Q = {2i, Qi, •••, 2^}, we 
define the access frequency of a structural minterm predicate m/? as 
follows. 

k=q 

acc(mp) = ^ use(Qk, mp) 

k=i 

In practice, there may exist many minterm predicates. It is too 
expensive to enumerate all minterm predicates. Therefore, we prune 
some minterm predicates with too small access frequencies. 

Given a structural minterm predicate mp, it can then be trans¬ 
formed into SPARQL queries, resulting in a horizontal fragment 
of the RDF graph. We use the results imp^c of a selection opera¬ 
tion based on mp to generate a horizontal fragment. All horizontal 
fragments generated from the structural minterm predicates that we 
obtain construct a horizontal fragmentation. Given a set of minterm 
predicates M, we formally define its corresponding horizontal frag¬ 
mentation over an RDF graph G as follows. 

Definition 12. (Horizontal Fragmentation) Given an RDF 
graph G and a structural minterm predicate mp, a horizontal frag¬ 
ment F generated from mp is defined as F - {V(F), E(F),L"], 
where (1) V(F) c V(G) is the set of vertices occurring in ^mp^ci 
(2) E(F) c E(G) is the set of edges occurring in ^mp^ci ^nd (3) 
L" c L is the set of edge labels occurring in ^mpJc- 

Then, given a set of structural minterm predicates M = {mpi,mp 2 , 
...,mpy], thecorrespondinghoxizontdMmgmontntionisT - {F/|0 < 
i < y and Ft is the horizontal fragment generated from mpt.) 

Example 3. Given the structural simple predicates in Exam¬ 
ple we can get all structural minterm predicates from frequent 
access pattern p^ as follows: (1). mpi : p3(lx0) = Aristotle A 
P 3 (^xl) - Ethics; (2) mp 2 : P 3 (lxf)) = Aristotle A p 3 (lx\) 4 
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Definition 14. (Allocation Graph) Given a fragmentation "F = 
{Fi,F 2 , the corresponding allocation graph AG = {V(AG), 

E(AG),fw} is defined as follows: 

• V(AG) is a set of vertices that map to all fragments; 


(a) Example Horizontal Frag¬ 
ment Generated from mpi 


(b) Example Horizontal Frag¬ 
ment Generated from mp 2 
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• E(AG) is a set of undirected edges that vv' G E(VG) if and 
only if the fragment affinity metric between the correspond¬ 
ing fragments ofv and v' is larger than 0; 

• fw is a weight function fw : E(AG) N'^. If v and v' 
correspond to fragments E and E', fwfrv') = aff(E, E'). 

Then, the allocation problem is equivalent to cluster all frag¬ 
ments in m clusters, and all fragments in a cluster are connected 
in AG. We define the density of a cluster A; in AG to rate the qual¬ 
ity of A/ as follows. 


Figure 6: Example Horizontal Fragments 


Ethics; (3). mp 3 : p 3 {lx 0 ) ^ Aristotle A p 3 {lx\) - Ethics; (4). 
^P 4 • ^ Aristotle A p 3 {lx\) 4 Ethics. 

Eigure^shows all horizontal fragments generated from the above 
structural minterm predicates. 

6. ALLOCATION 

After fragmenting the RDF graph, the next step is to allocate all 
fragments on several sites. In real applications, some frequent ac¬ 
cess patterns or structural minterm predicates are usually accessed 
together, so their corresponding fragments should be placed in one 
site to further avoid the cross-fragments joins. There is a need 
for some measures evaluating precisely the notion of “together¬ 
ness”. This measure is the affinity of fragments, which indicates 
how closely related the fragments are. 

We define fragment affinity metric to measure the togetherness 
between two fragments generated from frequent access patterns or 
structural minterm predicates as follows: 

Definition 13. ( Fragment Affinity Metric ) The fragment affin¬ 
ity metric between two fragments E and E' with respect to the 
workload Q = Qq] is defined as follows 

• aff{E,E') = use(Qk,p) X use(Qk,p'), if E and E' are 
vertical fragments generated from frequent access patterns p 
and p'; 

• aff{E,E') = use(Qk,mp) X use(Qk,mp'), if E and E' 
are horizontal fragments generated from structural minterm 
predicates mp and mp'; 

Based on the fragment affinity metric, we can show how closely 
related the fragments are. If the affinity metric of two fragments 
is large, it means that these two fragments are often involved by 
the same query. Some fragments are so related that they should 
be placed together to reduce the number of cross-sites joins. Here, 
we group all fragments into some clusters. The result of clustering 
corresponds to an allocation and each cluster corresponds to an 
element of JA, which means that all fragments in the cluster are 
placed into the same site. 

There are many clustering algorithms to cluster all fragments and 
we need to select one of them. In this paper, we extend a graph 
clustering algorithm, PNN Q, to cluster all fragments into an allo¬ 
cation = {Ai, A 2 ,A^}. All fragments of the same cluster are 
put into one site. 

First, we build the allocation graph as follows. 


2 _ fwfri^j, 

ViGAiAvjeAiAviVjeE(AG) 


\ 2 

where 2 fwfri^j) is the sum of weights of all edges 

Vi gAj AVjGAi Avi VjGE{AG) 

in A/ and | j is the maximum possible number of edges. 

The objective of our allocation algorithm is to search for m sub¬ 
graphs of AG that have the highest densities. Unfortunately, this 
problem is NP-complet e pO) , so we propose a heuristic solution as 
Algorithmic Algorithm|Cis a variant of PNN and picks the locally 
optimal choice of merging two vertices in AG at each step. Because 
our objective function can guarantee the locally optimal choice is 
also the optimal choice for the overall solution. Algorithm [C can 
find out the optimal clustering result of AG. 

Generally speaking, we initialize a cluster for each fragment. 
Then, we repeatedly picks the two clusters (singletons or larger) 
that have the highest weight value to be merged. The weight be¬ 
tween two clusters are the density value of merging them. Such 
merging is iterated until the size of the allocation graph has been 
reduced to m. 


Algorithm 2: Allocation Algorithm 
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Input: The allocation graph AG and the preset threshold 6 
Output: An allocation = {A 1, A 2 ,..., A^} 
for each vertex v/ in V(VG) do 
I A,-^{vd; 

Find the edge emax with the highest weight in E{AG); 
Initialize AG' that is the same to AG; 
while |y(AG')l ^ m do 

Generating AG' from AG by merging emax = AjAj to A/y; 
for each A^ adjacent to Aij in E(AG') do 

E _ fwiWj) 

Vi€A,,Aivj€AiVvj€Aj)AviVj€E(AG) 

fwAkAij) <-- 

Find the edge emax with the highest weight in E(AG'); 


7. DISTRIBUTED QUERY PROCESSING 

In this section, we discuss how to process a SPARQL query. For 
query processing, the metadata is necessary and we introduce how 
to maintain the metadata in a data dictionary in Section [TT] Then, 
we discuss how to decompose a query into some subqueries in Sec¬ 
tion |7^ Last, we discuss how to produce a distributed execution 
plan and execute all subqueries based on the plan in Section [ 73 ] 
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Figure 7: A New Input Query and Its Example Valid Decompositions 


7.1 Data Dictionary 

After fragmentation and allocation, the results of fragmentation 
and allocation need to be stored and maintained by the system. 
This information is necessary during distributed query processing. 
This information is stored in a data dictionary. The data dictio¬ 
nary stores a global statistics file generated at fragmentation and 
allocation time. It contains the following information: fragment 
definitions, their sizes, site mappings, access frequencies and so 
on. 

Since each fragment corresponds to a frequent access pattern or a 
structural minterm predicate, the data dictionary uses the frequent 
access pattern with/without constraints as the representative of a 
fragment. Each frequent access pattern with/without constraints 
corresponds to a fragment and is associated with all statistics of 
the fragment. The data dictionary need to fast retrieve all frequent 
access patterns with/without constraints to determine the relevant 
frequent access pattern for a query. 

We build a hash table to achieve the above objective. We first 
use the DFS coding p 6 ) to translates frequent access patterns into 
sequences. With the DFS code of a frequent access pattern, we can 
map any frequent access pattern to an integer by hashing its canon¬ 
ical label. Then, we use the hash table to locate frequent access 
patterns and retrieve the statistics of their corresponding fragments. 

7.2 Query Decomposition 

When users input a query g, the system first uses the data dic¬ 
tionary to determine which fragments are involved in the query and 
decomposes the query into some subqueries on fragments. 

Given a query g, a decomposition of g is a set of subqueries 
D - qt) such that ( 1 ) each qi is a subgraph of g and qi 

maps to a frequent access pattern or structural minterm predicate; 
(2) y(^i)U...uyfe) = V(g); and(3)F(^i)U...UEfe) = E(g)AV/ ^ 
j,E{qi)nE{qj) = 0 . 

Since we partition the RDF graph based on the frequent access 
patterns, we also decompose the query based on the frequent access 
patterns. In other words, we decompose the query into subqueries 
that are homomorphic to frequent access patterns. If a query in¬ 
volves infrequent properties that cannot be decomposed into sub¬ 
queries homomorphic to any frequent access patterns, then each 
connected subgraph of the query that only contains infrequent prop¬ 
erties corresponds to a subquery. We define the valid decomposi¬ 
tion as follows. 

Definition 15. (ValidDecomposition) Given a SPARQL query 
Q, a valid decomposition D = {qi.qi^ •••. qt) of Q should meet the 
following constraint: if qt (I < i < t) is not homomorphic to any 
frequent access patterns, all edges in qt should be cold edges. 

There exist at least one valid decompositions. A possible decom¬ 
position is the decomposition of all subqueries of a single edge. 


Because we select all frequent access patterns of one edge, the de¬ 
composition of all subqueries of a single edge is valid. Besides the 
valid decomposition, there may also exist some other valid decom¬ 
positions. Hence, we propose a cost-model driven selection and 
the best valid decomposition is the valid decomposition with the 
smallest cost. 

Here, we assume that the cost of a decomposition is the cost of 
joining all matches of the subqueries in tD and each pair of sub¬ 
queries’ matches can join together. The assumption is the worst 
case, so that we can quantify the worst-case performance. Then, 
we define the cost of a decomposition as follows. 

cost(D) = ]~| cardiqt) 

qteD 

where cardiqt) is the number of matches for qt, which can be esti¬ 
mated by looking up the data dictionary. 

Example 4. Assume that an user inputs a new query as 
shown in Figure \7(a)\ Given frequent access patterns in Figure 

there can be two valid decompositions T)\ and D 2 as shown in 
Figures \7(b)\ and \7(c)\ For vertical fragmentation, ^23 In D 2 is 
evaluated on the vertical fragment of ps (Figure^; for horizontal 
fragmentation, ^23 A evaluated on the horizontal fragment of mp 2 
(Figure ^(F)\ . 

Whether in vertical or in horizontal fragmentation, it is obvious 
that D 2 has fewer subqueries than Di and card{q 23 ) < card{qif) X 
card(qu) X card{qif). Hence, cost{T) 2 ) is smaller than cost{T)\), 
and D 2 is more of a priority as the final decomposition. 

Based on the above definitions, we propose the query decom¬ 
position algorithm as Algorithmic Because the SPARQL query 
graphs in real applications usually contain 10 or fewer edges, we 
can use a brute-force implementation to enumerate all possible de¬ 
compositions and find the decomposition with the smallest cost. 

7.3 Query Optimization and Execution 

After decomposing the query, the next step is to find an execution 
plan for the query which is close to optimal. In this section, we dis¬ 
cuss the major optimization issue of finding execution plan, which 
deals with the join ordering of subqueries. We extend the algorithm 
of System-R to find the optimal execution plan for distributed 
SPARQL queries. The algorithm is described in Algorithmic 

Generally speaking. Algorithm |C is a variant of System-R style 
dynamic programming algorithm. It firstly generates the best exe¬ 
cution plan ofn- \ subqueries, and then join the matches ofn - \ 
subqueries with the matches of n-\h subquery. The cost of an execu¬ 
tion plan can also be estimated based on the number of subqueries’ 
results, which is stored in the data dictionary. 

Finally, each subquery is executed in the corresponding sites in 
parallel. The optimization of each subquery uses the existing meth¬ 
ods in centralized RDF database systems. After the matches of all 



































Algorithm 3: Query Decomposition Algorithm 
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Input: A query Q 

Output: A valid decomposition D = {qi,q2,qt) of query Q 
MinCost <— hoo; 

Initialize D as the decomposition of all subqueries of a single edge; 
for each possible valid decomposition D' = {q\,qt] do 
CurrentCost 1; 
for each query qt in T>' do 

Estimate the number of results for qi as card{qi) based on the 
data dictionary; 

CurrentCost <— CurrentCost x card{qi) 
if MinCost > CurrentCost then 

MinCost <— CurrentCost, 

Return 


subqueries are generated, we join them together according to the 
optimal execution plan. 


Algorithm 4: Query Optimization Algorithm 
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Input: A decomposition D = {qi,q2,qt} of query Q 
Output: An execution plan {...{(qn x qa) x qts) x ... x qu) 
for each two subqueries (qt) and (qj) where \ <i 7 < 7 do 
I Initialize an execution plan qt x qj and estimate its cost; 

I Store all execution plans and their costs in a table T 2 ; 
for / = 3 to r do 

for each execution plan plj in Ti-\ do 

for each subquery q^ that is not contained by plj do 
Build execution plan plj x qj^ and estimate its cost; 

Store this execution plan and its costs in a table Tp 
for each two plans plj and plk in Tt do 

if plj and pk map to the same set of subqueries then 
I Eliminate one of plj and plk that has the larger cost; 
Return the execution plan with the minimum cost; 


8. EXPERIMENTAL EVALUATION 

We conducted extensive experiments to test the effectiveness of 
our proposed techniques on a real dataset, DBPedia, and a synthetic 
dataset, WatDiv. In this section, we report the setting of test data 
and various performance results. 


(H) to find matches. We use MPICH-3.0.4 running on C++ to join 
the results generated by subqueries. 

For fair performance comparison, we use gStore and MPICH- 
3.0.4 to re-implement two recent distributed RDF fragmentation 
strategies. The first one is SHAPE fT4| , which defines a vertex 
and its neighbors as a triple group and assigns the triple groups 
according to the value of its center vertices. There are many dif¬ 
ferent kinds of triple groups in (E) and we use the subject-object- 
based triple groups in this paper. The second one is WARP j^. 
WARP first uses METIS ||T^ to divide the RDE graph into frag¬ 
ments. Then, it replicates all matches of a query pattern that cross 
two fragments in one fragment. We use all frequent access patterns 
to extend the fragments in WARP. 

8.2 Parameter Setting 

Our frequent access patterns selection method uses a parameter: 
minSup. In this subsection, we discuss how to set up minSup to 
optimize query processing. Note that, since the numbers of query 
templates and queries per query template in WatDiv are specified 
by users, the parameters can also be determined beforehand. Thus, 
we only discuss how to set the parameters for DBPedia. 

Given a workload Q, we set the support threshold, minSup, to 
find patterns whose access frequencies are larger than minSup. It 
is clear that the smaller minSup is, the larger number of frequent 
access patterns there are. More frequent access patterns mean that 
a query in the workload may have a higher possibility to contain 
some frequent access patterns. 



(a) minS up (b) Workload Hitting Ratio 

Figure 8: Effect of Frequent Access Patterns 


8.1 Setting 

DBPedia. DBPedif^is an RDF dataset extracted from Wikipedia. 
The DBPedia contains 163,977,110 triples. We use the DBpe- 
dia SPARQL query-log as the workload. This workload contains 
queries posed to the official DBpedia SPARQL endpoint in 14 days 
of 2012. After removing some queries that cannot be handled, there 
are 8,151,238 queries in the workload. 

WatDiv. WatDiv |1| is a benchmark that enable diversified stress 
testing of RDF data management systems. In WatDiv, instances of 
the same type can have the different sets of attributes. For testing 
our methods, we generate five datasets varying sizes from 50 mil¬ 
lion to 250 million triples. By default, we use the RDF dataset with 
100 million triples. In addition, WatDiv can generate a workload 
by instantiating some templates with actual RDF terms from the 
dataset. WatDiv provides 20 templates to generate test queries. We 
use these benchmark templates to generate a workload with 2000 
test queries. 

We conduct all experiments on a cluster of 10 machines running 
Linux, each of which has one CPU with four cores of 3.06GHz. 
Each site has 16GB memory and 150GB disk storage. We select 
one of these sites as a control site. At each site, we install gStore 

^ http: //km. aifb.kit. edu/proj ects/btc-2012/dbpedia/ 


Figure [8^ shows the impact of minS up. As minS up increases, 
the number of frequent access patterns (FAPs) decreases. Hence, 
when we set minSup as 0.1% of the total number of queries in 
the workload, there are 163 frequent access patterns for DBPedia. 
When minSup is 1% of the total number of queries, the number 
of frequent access patterns is reduced to 44 for DBPedia. Further¬ 
more, fewer frequent access patterns means that fewer queries in 
the workload are hit, as shown in Figure [8(b^ 

Even if we set minSup as 0.1% of the total number of queries, 
the number of frequent access patterns is not large. Hence, in the 
following, we set minSup as 0.1% of the total number of queries 
for DBPedia by default. 

8.3 Throughput 

In this experiment, we test the throughput of different fragmen¬ 
tation strategies. We sample 1% of all queries in the workload and 
measure the throughput in queries per minute. Figure]^ shows the 
number of queries answered in one minute of different fragmenta¬ 
tion strategies. 

For SHAPE and WARP, each query concerns all fragments, so 
queries are still processed sequentially. Since WARP is more bal¬ 
anced than SHAPE, the throughput of WARP is a little better than 
SHAPE. WARP can handle about 32 and 82 queries in one minute 
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(a) DBPedia (b) WatDiv 

Figure 9: Throughput Comparison 

for DBPedia and WatDiv, while SHAPE can handle 24 and 75 
queries. 

For the vertical fragmentation strategy (VF), since a query of¬ 
ten only contains a few frequent access patterns, it only involves 
a few fragments. Two queries involving different fragments can 
be evaluated in parallel. Hence, about 46 queries and 533 queries 
can be answered in one minute for DBPedia and WatDiv, respec¬ 
tively. For the horizontal fragmentation strategy (HF), each fre¬ 
quent access pattern specified by the query may map to many struc¬ 
tural minterm predicates and the corresponding fragments of these 
structural minterm predicates may be allocated to different sites. 
Hence, the throughput of the horizontal fragmentation strategy is a 
little worse than the vertical fragmentation strategy, and 38 and 385 
queries can be answered in one minute for DBPedia and WatDiv. 

8.4 Response Time 

In this experiment, we test the query performance of different 
fragmentation strategies. We also sample 1% of all queries in the 
workload and compute the average query response time of a query. 
Figureshows the performance results. 

SHAPE and WARP partition the RDF graph into some subgraphs, 
and distributes these subgraphs among different sites. The query 
should be processed in many sites in parallel. Hence, SHAPE is 
less balanced and sometime need cross-fragment joins, so SHAPE 
needs about 2.5 and 0.79 seconds to answer a query for DBPedia 
and WatDiv, while WARP takes 1.8 and 0.72 seconds. 

For the vertical fragmentation strategy, only relevant fragments 
are searched for matches and the search space is reduced. There¬ 
fore, a query can be answered in about 0.8 seconds for DBPedia and 
0.3 seconds for WatDiv. For the horizontal fragmentation strategy, 
we can filter out all irrelevant fragments mapping to the structural 
minterm predicates not specified by the query, which can further re¬ 
duce the search space. Hence, a query can be answered with about 
0.6 seconds for DBPedia and 0.15 seconds for WatDiv. 
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Figure 10: Performance Comparison 


8.5 Scalability Test 

In this experiment, we investigate the impact of dataset size on 
our fragmentation strategies. We generate five WatDiv datasets 


varying the from 50 million to 250 million triples to test our strate¬ 
gies. FigurepTjshows the results. Generally speaking, as the size of 
RDF datasets gets larger, the average response times of one query 
increase and the numbers of queries answered in one minute de¬ 
crease accordingly. However, the rates of increase and decrease are 
slow, and we can say that the query performance and throughput 
are scalable with RDF graph size on the datasets. 



(a) Performance 


(b) Throughput 


Figure 11: Varying Size of Datasets 

8.6 Redundancy 

Table shows the redundancy ratio of the number of edges in 
all generated fragments to the total number of edges in the original 
RDF graph for each fragmentation strategy. For SHAPE, if a frag¬ 
ment contains a vertex with high degree, all adjacent edges of the 
high degree vertex are introduced. Most of these introduced edges 
are redundant, and cause the redundancy ratios of SHAPE nearly 3 
for DBPedia and 1.74 for WatDiv. WARP divides the RDF graph 
while minimizing the edge cut, so the number of edges crossing 
two fragments for WARP is smaller than the number for SHAPE. 
Therefore, the redundancy ratio of WARP is smaller. Note that, 
WatDiv is much denser than DBPedia, so the minimum cut-set for 
WatDiv contains a higher proportion of edges. Hence, the redun¬ 
dancy ratio of WatDiv is 1.54, but the ratio of DBPedia is only 1.01. 



DBPedia 

WatDiv 

SHAPE 

2.99 

1.74 

WARP 

1.01 

1.54 

VF 

1.38 

1.04 

HF 

1.42 

1.06 


Table 1: Redundancy (Ratio to original dataset) 

Our fragmentation strategies find and materialize some frequent 
access patterns (or structural minterm predicates). As discussed 
in Section |8.2| the number of frequent access patterns is limited. 
Hence, the redundancy ratios of our fragmentation strategies are 
limited. Note that, the horizontal strategy has a little larger redun¬ 
dancy ratio than the vertical fragmentation strategy. This is because 
that different structural minterm predicates derived from the same 
frequent access patterns share some common triple patterns. These 
common triple patterns may cause more redundant edges. 

8.7 Offline Performance 

Table|^shows the data partitioning and loading time of the datasets 
for different fragmentation strategies. Although SHAPE has an al¬ 
most perfect uniform distribution, its redundancy ratio is too large 
and each fragment contains too many redundant edges. Hence, 
loading fragments in SHAPE also takes much time. WARP uses 
METIS Since DBPedia is sparse (i.e. \(E(G)\/\V(G)\ « 1), 
METIS can guarantee that there are a few redundant edges and all 
fragments have a nearly uniform distribution. Then, WARP has less 
loading time than SHAPE. However, for WatDiv, the data graph 
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Figure 12: Query Performance of Benchmark Queries 


is dense (i.e. |(F(G)|/|V(G)| » 1), so the fragmentation result of 
METIS is unbalanced. Then, WARP takes more loading time than 
SHAPE to load the largest fragments. 

Since nearly half of all edges for DBPedia are infrequent edges, 
loading the cold graph of DBPedia is the bottleneck in our fragmen¬ 
tation strategies. However, in WatDiv, there are not so many infre¬ 
quent edges. Then, the loading time of our fragmentation strategies 
for WatDiv is more acceptable. Note that, because the structural 
minterm predicates are derived from the frequent access patterns, 
the cold graphs for the vertical and horizontal fragmentation strate¬ 
gies are the same. Thus, the loading times for the vertical and hor¬ 
izontal fragmentation strategies are the same. 



DBPedia 

WatDiv 

Strategies 

Partitioning 

Loading 

Total 

Partitioning 

Loading 

Total 

SHAPE 

41 

30 

71 

20 

19 

49 

WARP 

43 

28 

71 

33 

46 

79 

VF 

50 

97 

147 

31 

28 

59 

HF 

58 

97 

139 

34 

28 

62 


Table 2: Partitioning and Loading Time (in min) 

8.8 Experiments for Benchmark Queries 

In this experiment, we compare our methods with other fragmen¬ 
tation strategies on benchmark queries provided by WatDiv. There 
are 20 benchmark queries in WatDiv, and these queries can be clas¬ 
sified into 4 structural categories: linear (L), star (S), snowflake (E) 
and complex (C). Eigure[^ shows the performance of different ap¬ 
proaches. Generally speaking, we find out that our methods outper¬ 
forms other two methods in most cases. This is because that each 
benchmark query can be decomposed into some frequent access 
patterns or structural minterm predicates. Hence, our fragmenta¬ 
tion strategies can filter out many irrelevant fragments. In contrast, 
SHAPE and WARP always concern all fragments, and SHAPE fur¬ 
ther needs some cross-fragment joins for complex queries. 

Let us look deeper into Eigure and analyze each individual 
fragmentation strategy. SHAPE has to involve all fragments for any 
queries, so its performance is always worse than our fragmentation 
strategies. In particular, for star queries (^i to 5 7), the difference 
between the query response times of SHAPE and our fragmentation 
strategies is not very large, because the subject-object-based triple 
groups that we use can guarantee that there is no intermediate re¬ 
sult and all star queries can be answered at each fragment locally. 
However, for other shapes of queries, SHAPE has to decompose 
the queries and do cross-fragment joins to merge the intermediate 
results. Then, the performance of SHAPE decreases greatly. Es¬ 
pecially for the unselective queries (Li, Fi, F 2 , F 3 , F 4 , F 5 , Ci and 
C 2 ), the performance of SHAPE is an order of magnitude worse 
than our fragmentation strategies. 

Since WARP also use patterns to replicate triples for avoiding 
cross-fragment joins in complex queries, WARP has better perfor¬ 
mance that SHAPE in most case. However, WARP still always 


concerns all fragments in all sites for any kind of queries. The 
search space of WARP for a query is higher than our fragmenta¬ 
tion strategies. Thus, our fragmentation strategies always result 
in better performance. Especially for the query of very complex 
structure (C 2 ), our fragmentation strategies can filter out many ir¬ 
relevant fragments, which can result in much smaller search space 
than WARP. Hence, for C 2 , our strategies is twice as fast as WARP. 

Since all benchmark queries are generated from instantiating bench¬ 
mark templates with actual RDE terms, these benchmark queries al¬ 
ways correspond to a limited number of minterm predicates. Hence, 
the horizontal fragmentation is always faster than the vertical frag¬ 
mentation. 

9. RELATED WORK 

Eor both the general graph and the RDE graph, as the graph size 
grows beyond the capability of a single machine, many works 

have been proposed about 
graph fragmentation and allocation. We can divide all these meth¬ 
ods into two categories: global goal-oriented graph fragmentation 
methods and local pattern-based graph fragmentation methods. 

Global Goal-Oriented Graph Fragmentation. For this kind of 
methods |[^|^[^|^[^, they divide G into several fragments 
while maximizing some goal function. They first transform a large 
graph into a small graph; then, apply some graph partitioning al¬ 
gorithms on the small graph; finally, the partitions on the small 
graph are projected back to the original graph. These methods of¬ 
ten apply some existing methods (such as KL fT?| ) directly on the 
transformed graph in the second step. If we track the transforming 
step, the partitions on the small graph can be easily projected back 
to the original graphs in the third step. Hence, the largest difference 
among different graph coarsening-based methods is how to coarsen 
the original graph into a small graph. 

In particular, METIS uses the maximal matching to coarsen 
the graph. A matching of a graph is a set of edges that no two edges 
share an endpoint. A maximal matching of a graph is a matching to 
which no more edges can be added and remain a matching. Graph- 
Partition 191 directly uses METIS in the RDF graph. WARP ||^ 
uses some frequent structures in workload to further extend the re¬ 
sults of GraphPartition. EAGRE p0| coarsens the RDF graph by 
using the entity concept in RDF data. It considers an entity to be a 
subject and its complete description. By grouping the entities of the 
same class, an RDF graph can be compressed as a compressed RDF 
entity graph. MLP designs a method to coarsen the graph by 
label propagation. Vertices with the same label after the label prop¬ 
agation are coarsened to a vertex in the coarsened graph. Sheep 
fTh) transform the graph into a elimination tree via a distributed 
map-reduce operation, and then partition this tree while reducing 
communication volume. Tomaszuk et. al. (D briefiy survey how 
to apply existing graph fragmentaion solutions from the theory of 
graphs to RDF graphs. 

Global goal-oriented graph fragmentation methods assume that 























































if there are few edges crossing different fragments, the communi¬ 
cation cost is little. If an application involves nearly all vertices in 
the graph, few cross-fragments edges indeed result in little commu¬ 
nication. A typical application suitable for graph coarsening-based 
methods is PageRank. 

In some applications, one static fragmentation cannot fit all. Hence, 
Sedge maintains many fragmentations with different crossing 
edges, while Shang et. al. move some vertices of one frag¬ 
ment to another fragment during graph computing according to the 
workload. Yan et. al. propose a indexing scheme based on 
fragmentation to help query engine fast locate the instances. 

Local Pattern-based Graph Fragmentation. For this kind of 
methods |[^ , they first find certain patterns 

as the fragmentation units to cover the whole graph; then, they dis¬ 
tribute these patterns into sites. The local pattern-based methods 
mainly differ in their definitions of the fragmentation unit. 

HadoopRDF p0| groups triples with the same property together 
and each group corresponds to a fragmentation unit. Then, they 
store all fragmentation units over HDFS. Yang et. al.(^ define 
some special query patterns, and subgraphs of a pattern are con¬ 
sidered as a fragmentation unit. Lee et. al. pA) [TS) define the 
fragmentation unit as a vertex and its neighbors, which they call 
a triple group. The triple groups are distributed based on some 
heuristic rules. For each vertex, SketchCluster identifies the 
set of labeled vertices reachable within its one-hop neighborhood 
as its features and employs the KModes algorithm to group related 
vertices based on the features. Partout extends the concepts of 
minterm predicates in relational database systems, and uses the re¬ 
sults of minterm predicates as the fragmentation units. TriAD Q 
uses METIS (H to divide the RDF graph into many partitions. 
Then, each result partition is considered as a unit and distributed 
among different sites based on a hash function. PathPartitioning 
p5) uses paths in RDF graphs as fragmentation units. 

Local pattern-based graph fragmentation methods assume that 
some real applications only concerns a part of the whole graph. If 
an application only concerns the vertices of some certain patterns, 
these methods only access the relevant fragments and reduce the 
communication cost across fragments. A typical example applica¬ 
tion is subgraph homomorphism checking. 

10. CONCLUSION 

In this paper, we discuss how to manage the large RDF graph 
in a distributed environment. First, we mine and select some fre¬ 
quent access patterns to partition the RDF graph into many smaller 
fragments. Then, we propose an allocation algorithm to distribute 
all fragments over different sites. Last, we discuss how process the 
query based on the results of fragmentation and allocation. Exten¬ 
sive experiments verify our approaches. 
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